-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work in progress: upgrade to astropy ecosystem structure #3
base: master
Are you sure you want to change the base?
Conversation
I've just added a first (very generously unrigorous) end-to-end test which you can run with
It runs the linear regression example used in the docs and checks that:
|
After a struggle, the package now passes its first continuous integration tests! See passing travis builds here, which include a format/style checker (via flake8), and an end-to-end functional test of the entire module (via pytest). 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bmorris3,
Please to e-meet you as well :). I went quickly through your changes. It looks great.
I left some minor comments, specially regarding the names of a few functions/classes that we can discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bmorris3
I left a few more comments. My biggest concerns are about the necessity of the class Model
. I think we could get rid of it and just provide some auxiliary functions, since the main functionality is already in the RandomForestWrapper
, including saving/loading. We can discuss it in the next meeting on Wednesday, if you are around.
I can take care of the implementation details if you want. But maybe it's better to wait until the user interface is clear.
hela/model.py
Outdated
return ranges.T | ||
|
||
|
||
def generate_example_data(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is generating data with no noise and with independent slopes
and ints
, which makes the posterior computation pointless. I think generate_example_data
should build an example dataset that show the ability of estimating posteriors, which is the main strength of this library. Ideally a dataset with some correlation between slopes
and ints
that show up in the posterior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying to craft an example with correlations between slopes and ints but haven't been able to create a good working example. Would you be able to give me some more hints how to demonstrate this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bmorris3,
Sorry, I have realized that my comment was very vague and unclear. Something like this might work:
def generate_example_data():
"""
Generate an example dataset in the new directory ``linear_dataset``.
Returns
-------
example_dir : str
Path to the directory of the example data
training_dataset : str
Path to the dataset metadata JSON file
samples : `~numpy.ndarray`
"""
example_dir = 'linear_dataset'
training_dataset = os.path.join(example_dir, 'example_dataset.json')
os.makedirs(example_dir, exist_ok=True)
# Save the dataset metadata to a JSON file
dataset = {
"metadata": {
"names": ["slope", "intercept"],
"ranges": [[0, 1], [0, 1]],
"colors": ["#F14532", "#4a98c9"],
"num_features": 1000
},
"training_data": "training.npy",
"testing_data": "testing.npy"
}
with open(training_dataset, 'w') as fp:
json.dump(dataset, fp)
# Generate fake training data
npoints = 20000
slopes = np.random.uniform(size=npoints)
ints = np.random.uniform(size=npoints)
x = np.linspace(0, 1, 1000)[:, np.newaxis]
# Add correlated noise to parameters to introduce degeneracies
noise_ints = np.random.normal(scale=0.15, size=npoints)
noise_slopes = np.abs(noise_ints) + np.random.normal(scale=0.02, size=npoints)
# Add also noise to data points (not strictly necessary)
data = ((slopes + noise_slopes) * x + (ints + noise_ints) +
np.random.normal(scale=0.01, size=(1000, npoints)))
labels = np.vstack([slopes, ints])
X = np.vstack([data, labels])
# Split dataset into training and testing segments
training = X[:, :int(0.8 * npoints)].T
testing = X[:, int(0.8 * npoints):].T
np.save(os.path.join(example_dir, 'training.npy'), training)
np.save(os.path.join(example_dir, 'testing.npy'), testing)
# Generate a bunch of samples with a test value to "retrieve" with the
# random forest:
true_slope = 0.7
true_intercept = 0.5
samples = true_slope * x + true_intercept
return training_dataset, example_dir, samples.T[0]
The posteriors would look like this:
hela/model.py
Outdated
return np.array([forest_i.feature_importances_ for forest_i in forests]) | ||
|
||
|
||
class Model(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about how the user should use this class. Is it that important to have a class like this? As a user of an API, I would expect having more fine-grained control.
- It saves the model during training. While this makes sense in command like, a user now might want to train a predict in a single session, with no saving involved. I think the user should be responsible of printing/saving when and where they want. Of course, we could give auxiliary functions to help with saving and printing.
- It assumes the format of the dataset. The user might want to use a different format.
- It loads the saved model from a specific location every time
predict
is called, which makes no sense if the user does not need to save the model. - It only allows to predict from a single
data_file
given in the constructor.
The code from the train
method, for example, should be what the user might write to train a model and build the predicted_vs_real plots. But I don't see why we should give that as a method of a class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, we should definitely take a closer look at this class.
I think the user should be responsible of printing/saving when and where they want. Of course, we could give auxiliary functions to help with saving and printing.
I don't feel strongly about this either way. I think there may still be use cases where the saving is helpful – for example when the training takes a very long time to run. Whatever you think makes sense 👍
It assumes the format of the dataset. The user might want to use a different format.
I'm not sure I understand – can't the user supply arbitrary x
and y
(in the correct shapes)?
It only allows to predict from a single data_file given in the constructor.
Good point. Could the solution to all of the above points be to not expose this class to the user, and just use it internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't even think we need this class internally. Currently we expect the user to use the library like this:
from hela import Retrieval
# Here, Retrieval requires that the user uses our JSON format to describe
# the dataset. This is what I meant by "assumes the format of the dataset".
# They need to write a JSON file linking to NPY files that have a specific
# organization (first columns are features, last columns are labels). This is very
# restrictive.
r = Retrieval("dataset/dataset.json", "output_dir")
# Train the random forest
r2scores = r.train(num_trees=1000, num_jobs=1)
fig = r.plot_correlations()
fig.savefig("corr.pdf")
plt.show()
data = np.load("query.npy")
# Currently r.predict always loads the model from disk,
# which might be unnecessary.
posterior = r.predict(data)
# Do something with the posterior...
I find this very limited, but also unclear, as many non-obvious, implicit operations are happening under the hood (saving the model and building R2 scores in train
, loading the model and printing percentiles in predict
...)
Instead, without Retrieval
, the user could do this:
import hela
# Load the dataset
# We offer this function for convenience, but this is optional. The user
# might write their own load_dataset, or read the data from NPY files.
dataset = hela.load_dataset("dataset/training_dataset.json")
# Train the model
# Again, this is a helper function. To have a more fine grained control, the user
# can create a PosteriorRandomForest and call the fit method.
forest = hela.train_model(dataset, num_trees=1000, num_jobs=1)
# Print some info info
print(forest.oob_score_)
...
# Save the model. Optional.
hela.save("output/model.pkl", forest)
# Run validation
dataset = hela.load_dataset("dataset/testing_dataset.json")
pred, r2scores = hela.test_model(forest, dataset)
# Plot
fig = hela.plot_predicted_vs_real(dataset.y, pred, dataset.names, dataset.ranges)
fig.savefig("output/r2.pdf")
# Prediction
data = np.load("query.npy")
# No load required here, but we can load a model with forest = hela.load("output/model.pkl")
posterior = forest.predict_posterior(data)
# or hela.predict_posterior(forest, data) to be consistent with other functions.
# Do something with the posterior...
posterior_ranges = hela.posterior_percentiles(posterior) # or posterior.percentiles()
I think this approach gives complete control to the user, and at the same time it is more clear, as every operation is explicit and only happens if the user asks for it.
If you are fine with the second approach and it suits the astropy developer standards, we can remove the class Retrieval
and prepare the collection of helper functions (save
, load
...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long delay in response before I got back to you on this, it's been a busy few weeks!
I see the advantages to the above approach, but I'm still searching for a more object-oriented project structure and workflow. Let me see if I can split the difference between our proposals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest refactor looks like:
import matplotlib.pyplot as plt
from hela import Retrieval, Dataset, generate_example_data
# Create an example dataset
training_dataset, example_dir, example_data = generate_example_data()
# Load the dataset
dataset = Dataset.load_json("linear_dataset/example_dataset.json")
# Train the model
r = Retrieval()
r2scores = r.train(dataset, num_trees=1000, num_jobs=5)
# Print OOB score (optional)
print(r.oob)
# Plot predicted vs real:
fig, ax = r.plot_predicted_vs_real(dataset)
# Save the model (optional)
# from hela import save_model
# save_model("linear_dataset/model.pkl", r.model)
# Predict posterior distribution for slope and intercept of example data
posterior = r.predict(example_data)
# Print posterior ranges (optional)
posterior.print_percentiles(dataset.names)
fig2, ax2 = posterior.plot_posterior_matrix(dataset)
plt.show()
Is that a sufficient compromise between the explicit function calls you're looking for and the object-oriented structure I'm looking for? Again, happy to iterate 👍
Hey @pmneila – perhaps we can move forward with this PR, if you have a minute? |
Hi @bmorris3 Sure, let's finish this. BTW, Following some requests from @cefisher, I've implemented some additional functionality into HELA. However, this PR was created before those changes were implemented. How do you think we should proceed? Do we finish with this PR with no extra functionality and then we integrate it in subsequent PRs? I think that's the best option. |
I'm happy to go with your best judgement! Would that be much extra work for you? I don't know how big the merge conflicts are. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The merge conflicts are going to be huge. 😅 After going through the code, I still think the best solution is to continue working on this PR and then I'll integrate the new functionality following this new API as much as possible.
There are a few things in the new code that you might find relevant at this point:
- It will be necessary to pass the training data at prediction time (so we can compute things like the best-fit model (@cefisher can give you more details on this if you want)). One of the consequences of this is that we won't need to store names, ranges and colors in the
RandomForestPosterior
, since they will be available at test time in the dataset. I think this is actually much better. I found it very weird that the forest should save that information. - I split the classes
PosteriorRandomForest
(oldModel
) andPosterior
into two different files. I don't think this is problem, but maybe you want to keep everything inposteriors.py
? - I added type annotations. Not sure if that might be a problem with Astropy?
I went through the code and I think we are almost there. I added a few comments below.
@@ -18,7 +18,7 @@ | |||
POSTERIOR_MAX_SIZE = 10000 | |||
|
|||
|
|||
def plot_predicted_vs_real(y_real, y_pred, names, ranges, alpha='auto'): | |||
def plot_predicted_vs_real(dataset, retrieval, alpha='auto'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retrieval
is only used to get the prediction. Pass the prediction instead.
@@ -179,6 +180,46 @@ def __init__(self, samples, weights): | |||
self.samples = samples | |||
self.weights = weights | |||
|
|||
def print_percentiles(self, names): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function prints to the standard output. Code is much simpler if all functions with side effects (such as I/O) are present only in the high-level classes of the code (in this case, in the class Retrieval
). Posterior
is a low-level class, used internally to represent posteriors. A method like print_percentiles
makes sense only in the context of the class Retrieval
, but it is very odd for any other user of this class.
I understand that what you wanted to to is to allow the final user to do something like:
posterior = retrieval.predict(example_data)
posterior.print_percentiles(dataset.names)
I can think of two possible solutions. The simplest one is changing the interface to something like:
posterior = retrieval.predict(example_data)
retrieval.print_percentiles(dataset.names, posterior)
This also is coherent with the fact that all other functions for printing/plotting are in Retrieval
.
The other solution, if you really want to keep the original API, would be encapsulating the posterior in a more specific class of posteriors for retrieval (RetrievalPosterior
?) that contains all the methods that make sense in the context of the class Retrieval
.
[values[0], values[2] - values[0], values[0] - values[1]]) | ||
return ranges.T | ||
|
||
def plot_posterior_matrix(self, dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in print_percentiles
, this function performs I/O in the class Posterior
. Same things apply here.
Also, this function leads to a circular dependency in the code, since posteriors.py
now depends on plot.py
and vice versa.
"[+{:.3g} -{:.3g}]".format(name_i, *pred_range_i)) | ||
|
||
def data_ranges(self, percentiles=(50, 16, 84)): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is fine, but it seems a bit strange that this method is in the class Posterior
and the function resample_posterior
isn't. Rather put them both as separated functions or both as methods of this class? I have a slight preference towards independent functions and against classes, but if you prefer methods it'll be fine for me.
|
||
# saving model information... | ||
self.oob = self.model.rf.oob_score_ | ||
|
||
pred, r2scores = test_model(self.model, self.dataset) | ||
pred, r2scores = test_model(self.model, dataset) | ||
self.pred = pred | ||
return r2scores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we also return pred
? Not saying that we must do it, but if we return r2scores, it seems natural.
Hi @pmneila and @cefisher,
I'm a new postdoc working with Kevin Heng, nice to e-meet you @pmneila 👋 . One of my projects is to bring ESP services up to compliance with astropy developer standards, and
HELA
is the first toolkit I'm starting to work with. I hope for this PR to be the beginning of a conversation about tweaks we can make to the package. Ideally we can collaborate over GitHub, but I'm happy to meet in person if that would be helpful.This PR is a modest overhaul of the formatting within HELA, and a reorganization of the package structure. There are a few broad goals of the PR:
I've moved a copy of the present implementation and README of HELA into a directory called
legacy
, so the old functionality is preservedNew API
The API now allows for a workflow that looks like this:
The
train
andpredict
methods on theRandomForest
object work just like therfretrieval.py train
andrfretrieval.py predict
functions worked in the legacy version.Documentation
There's now a docs directory which contains sphinx self-building documentation which you can build locally on your copy of this branch with
The narrative documentation starts with an example that @cefisher and I worked through today where we demonstrate fitting a line (by predicting the slope and intercept parameters using a random forest) implemented by HELA.
GitHub integrations/webhooks
After this PR is merged, we can update the GitHub webhooks/integrations to get travis and readthedocs up and running.